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Abstract 

Given a weighted undirected graph G = (V,E,d), the Molecular Distance Geometry Problem 
(MDGP) is that of finding a function x : G — > R 3 , where — = d(u,v) for each {u,v} 6 E. 

We show that under a few assumptions usually satisfied in proteins, the MDGP can be formulated 
as a search in a discrete space. We call this MDGP subclass the Discretizable MDGP (DMDGP). 
We show that the DMDGP is NP-complete and we propose an algorithm, called Branch-and-Prune 
(BP), which solves the DMDGP exactly. The BP algorithm performs exceptionally well in terms of 
solution accuracy and can find all solutions to any DMDGP instance. We successfully test the BP 
algorithm on several randomly generated instances. 
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1 Introduction 

It is well known that the role and function of a molecule is determined by both its chemical structure 
(the atoms that compose it and the way they bond) and its three-dimensional structure. Supposing the 
chemical structure is known, finding the conformation of the atoms in K 3 is usually tackled by a mixture 
of chemical analysis and mathematical methods. Some insight as to the molecular spatial conformation 
can be gained by employing Nuclear Magnetic Resonance (NMR) techniques, which are able to give a 
measure of the distance between (but not of the positions of) pairs of atoms closer than around 5A. The 
problem of finding the atomic positions given a subset of atomic distances can be formalized as follows. 



Molecular Distance Geometry Problem (MDGP): given a weighted undirected graph 
G = (V,E,d), is there a function x : G — > R 3 such that \\x{u) — x{v)\\ = d(u,v) for each 

{u,v} e El 



The atoms are represented by the set of vertices V, the atomic positions by x(v), for v e V, and the 
inter-atomic distance between u and v is given by d(u, v), for {u,v} G E. This problem has been shown 
to be NP-complete via a reduction from Subset-Sum [19], although the problem is solvable in linear 
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time when all the inter-atomic distances are known [6] . The MDGP is usually formulated as a continuous 
nonconvex optimization problem: 

mmg(x) = ]T (\\x(u) - x(v)\\ 2 - d(u, vff. (1) 
{u,v}eE 

Obviously, x solves the problem if and only if g(x) = 0. 

In practice the MDGP is usually solved via continuous optimization methods. In [8], the molecule 
is decomposed into clusters; each cluster's 3D structure is determined independently of the others, and 
then the clusters are recombined. In [13, 14], a Gaussian smoothing of (1) is derived in a closed analytical 
form depending on a smoothing parameter A. The proposed algorithm is called Global Continuation 
Algorithm (GCA): the smoothed problem is locally solved for iteratively increasing values of A (this 
brings the smoothed problem closer and closer to the original problem), each local solution process 
starting from the solution of the previous smoothing. In [1, 2], the MDGP is formulated as a D.C. 
(difference of convex functions) programming problems and solved using a variant of the D.C. Algorithm 
(DCA). In [10, 12], two different Variable Neighborhood Search-based algorithms arc proposed. One of 
the most stringent limitations of all these algorithms is the solution accuracy. Because there exist many 
different spatial conformations having objective function values very near zero, it is important that the 
optimal solution should have an objective value as close to zero as possible. Continuous optimization 
methods, by the very limitations of floating point arithmetics, are not well suited to produce extremely 
accurate values. Two completely different approaches to solving the MDGP are given in [11] (based on 
quantum computation) and [21] (based on algebraic geometry). 

A protein consists of a main backbone (a chain of atoms) and several "dangling" side chains. The 
NMR technique can of course be applied to proteins in particular, and indeed many of the algorithms 
to solve the general MDGP have been tested on proteins. However, proteins have a particular structure 
which makes it possible to formulate the MDGP applied to protein backbones as a discrete search prob- 
lem: this has an enormous impact on the solution accuracy, as floating point arithmetics calculations are 
fewer than with continuous search methods. We formalize this by introducing the Discretizable Molecular 
Distance Geometry Problem (DMDGP), which consists of a certain subset of MDGP instances (to which 
most protein backbones belong) for which a discrete formulation can be supplied. The determination 
of the spatial position of the side chains is called the Side Chain Placement Problem (SCPP), and 
its discrete version is known to be NP-complcte [17, 18]. Although in this paper we only consider the 
determination of the protein backbone, it is clear that given a set of likely backbones, some of them can 
be discarded if the resulting SCPP instance turns out to be infeasible. In this sense, the DMDGP and 
the SCPP are largely complementary. 



Discretizable Molecular Distance Geometry Problem (DMDGP): given a weighted 
undirected graph G = (V,E,d) such that there exists an ordering v\, . . . ,v n 6 V satisfying 
the following requirements: 

1. E contains all cliques on quadruplets of consecutive vertices: Vi € {4, . . . ,n) Vj, k £ 
{i - 3,..., i}({j, *}€£?); 

2. the following strict triangular inequality holds: d(vi-i, i>i+i) < d(v{-i,Vi) + d(vi, i>i+i), 
for i = 2, ... ,n — 1, 

is there a function x : G — * R 3 such that \\x(u) — x(v)\\ — d(u,v) for each {u,v} G El 



The distances d(vi-i,Vi) are called bond lengths, for i = 2, . . . , n, and the angles $i_2,i between atoms 
Vi-2,Vi-i,Vi are called bond angles, for i = 3, . . . , n (see Fig. 1). The ordering on V is called the backbone 
ordering. Furthermore, we partition E in two sets H and F such that H — {{i, j} e E \i — j\ < 4} and 
F = E\H. 
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In practice, Assumption 1 requires that the bond lengths and angles, as well as the distances between 
atoms separated by three consecutive bond lengths are known. The distances between atoms separated 
by two consecutive bond lengths may of course be trivially computed from the bond lengths and angles. 
Assumption 2 says that no bond angle may be a multiple of it. Assumption 1 is applicable to many 
proteins as NMR is able to compute distances of atoms which are close together, and groups of four 
consecutive atoms in the backbone ordering are usually closer than the threshold value of 5 A [4, 20]. 
Assumption 2 is also applicable to proteins as, to the best of our knowledge, no protein has bond angles 
of exactly tt. In any case, the probability measure of a protein having a bond angle of exactly n is zero. 



i-Z 




Figure 1: Definitions of bond lengths, bond angles and torsion angles. 

We propose an algorithm based on the discrete formulation, called Branch-and-Prune (BP), to solve 
the DMDGP exactly. The BP algorithm is several orders of magnitude more accurate than other existing 
algorithms, and usually also much faster. Moreover, other algorithms targeting the MDGP address the 
question of experimental errors by introducing distance bounds. As it turns out, NMR not only produces 
systematic measurement errors (which may be dealt with by introducing distance bounds) but also, more 
importantly, a non-negligible quantity of completely wrong measures [3]. The BP algorithm is able to 
account for this type of error, and allows for a certain percentage of distances to be completely wrong. 

In Section 2, we derive the discrete formulation of the DMDGP. In Section 3, we prove that the 
DMDGP is NP-complcte. In Section 4, we discuss the BP algorithm to solve the DMDGP to optimality; 
Section 4.2 shows how the BP algorithm deals with the two main types of NMR error measurements. In 
Section 5, we show computational results on some randomly generated instances. Section 6 concludes 
the paper. 



2 Discrete formulation of the MDGP 

In what follows, we will restrict our attention to the DMDGP. Notationwise, we indicate x(vi) by Xi 
and d(vi,Vj) by dij. For all i G V, the neighbourhood S(i) of i is the set {j G V \ G E} 

of vertices adjacent to i. With respect to the order < on V given by the DMDGP definition, we let 
8(i) = {j G 8(i) | j < i} for all i e V. 

In order to describe a molecule with n atoms, in addition to the bond lengths di-ij, for i = 2, . . . ,n, 
and the bond angles #i_2,i, for i = 3, . . . , n, we also have to consider the torsion angles to-is^, for i — 
4, . . . , ?i, which are the angles between the normals through the planes defined by the atoms z — 3,z — 2, i — 1 
and i — 2,i — l,i (see Fig. 1). However, in most molecular conformation calculations, all the bond lengths 
and bond angles are assumed to be known a priori. Thus, the first three atoms of the molecule can be 
fixed and the fourth atom can be determined by the torsion angle u>i t i (see Fig. 2). The fifth atom can 
be determined by the torsion angles u>x t 4 and ^2,5, and so on. 
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Figure 2: Discretization of the problem. The atom i can only be in the two shown positions (i and i') in 
order to be feasible with the distance di-3,i. 



The geometrical intuition behind the discrete formulation is that the ?-th atom lies on the intersection 
of three spheres centered at atoms i — 3,z — 2,z — 1 and of radii di-3 t i, di- 2 ,ii <2t-i,i respectively. By 
Assumption 2 and the fact that no two atoms can ever take the same position in space, the intersection 
of the three spheres defines at most two points (labeled i and i' in Fig. 2). This allows us to express the 
position of the i-th atom in terms of the preceding three, giving us 2™~ 3 possible molecules. Of course 
some of these will be infeasible with respect to the distances in F (i.e. distance between atoms which are 
further apart than 4 units in the backbone ordering), as well as with respect to other constraints (see 
Sec. 4). 

It is known that [15], given all the bond lengths dip, ■ ■ ■ , rf n _i, n , bond angles #13, . . . ,0 n -2.n, and 
torsion angles wi,4, . . . , u) n —3,n of a molecule with n atoms, the Cartesian coordinates (xi lt Xi 3 ,Xi 3 ) for 
each atom i in the molecule can be obtained using the following formulae: 
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(2) 



Bi = 



- COS 0i-2 t i 

sin 6i- 2 ,i cos w,_3,, 
sin 6»i_2.i smu}i- 3t i 




- sin Vi-2,i 

COS 9i- 2 ,i COS Ui-3 t i 
- COS 0j_2,i smui_3 ti 




-<k-l,i COs9i- 2 ,i 

- sinuisj sin6»j_ 2 ,i coswi_ 3 ^ 

cosw,„ 3 ,i di-iA sm9i-2,i sinwi_ 3 j 







1 



(3) 
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for i = 4, n. 

Since all the bond lengths and bond angles are assumed to be given in the instance, the Cartesian 
coordinates of all atoms of a molecule can be completely determined by using the values of coso*i_3,i and 
sin Wi-s,,, for i = 4, n. 

In order to state that the DMDGP can be formulated as a search in a discrete space, we need the 
following lemma. 

2.1 Lemma 

For instances of the DMDGP class, for all i = 4, ...,n, the value of coso>i_3,j can be computed 0(1). 



Proof. This follows because for every four consecutive atoms Xi-3,Xi- 2 ,Xi-i,Xi, the cosine of the 
torsion angle o>j_3,j, for i = 4, ...,n, is given by 

d t-3.t-2 + d--2,i - 2di- 3 ,i-2di-2,i COS0*_ 2 , t COS0i_l, i+ l - d?_ 3 4 

C0SO>i_ 3 ,i = ! — — — '-, (4) 

^«i-3,i-2«i-2,i Sintfj_2,i Sin 

which is just a rearrangement of the cosine law for torsion angles [16] (p. 278), and all the values in the 
expression (4) arc given in the instance. We note in passing that in order for the above reasoning to hold, 
we obviously need the denominator of (4) to be nonzero. □ 



2.2 Theorem 

Given a weighted undirected graph G — (V, E, d) associated to an instance of the DMDGP, the number of 
functions x : G — > M. 3 such that \\x(u) — x(v) 1 1 = d(u, v) for each {u, v} 6 E is finite, up to orthogonality 
transformations. 



Proof. The proof is by induction. For a molecule with 4 atoms, we can use the bond lengths dip, d 2> 3 
and the bond angle 0i,3, in order to determine the matrices B 2 and B3, defined in (2), and obtain 



xi 



X2 



X ?> 




dip + ^2,3 cos 0i i3 
d 2 ,3sin6»i i3 




fixing the first three atoms of the molecule. Since we also know the distance di,4, by Lemma 2.1 we can 
obtain the value of cos 0)1,4. Thus, the sine of the torsion angle o>i,4 can have only two possible values: 
sin 01,4 = ±yT — cos 2 0*1,4. Consequently, by (3), we obtain only two possible positions 2:4, £4 for the 
fourth atom of the molecule, given by 



Xi 



£4 



— dip + ^2,3 COS #1,3 — 6^3,4 COS 01,3 COS 02,4 + C?3,4 sin0i,3 sin02.4 C0SO>i,4 

^2,3 sin 0i,3 — rf 3 ,4 sin 0i,3 cos0 2 ,4 — c^3,4 cos 0i,3 sin 2 ,4 C0SO>i,4 

c?3,4 sin 2 ■ ^ - /n — ' 

— dip + ^2,3 COS 01,3 - d 3 , 4 COS 01,3 COS 02,4 + I 

d 2 ,3 sin 0i,3 - d 3 ■■" 



, 02,4 + ^3,4 sin 0i,3 sin : 

^3,4 Sin 01,3 COS 02,4 — J 



■ ^1,3 0111 ^2,4 u^l.4 
!,4 - COS 2 0>1, 4 ) 



COS 2 ' 

h.4 + ^3,4 sin 0i,3 sin 02,4 cos 0*1,4 

'2,4 — ^3,4 COS 0i,3 sin 02,4 COS CJi,4 
A„ . oiv, d_ . / 1 _ ™o2 77 .\ 



c?3,4 sin 2 ,4 (— yl - cos 2 0*1,4) 



Now suppose that for i > 4 atoms, we have a finite number of solutions to the DMDGP instance. 
Adding one more atom in the molecule and using Lemma 2.1 again, we can obtain the value of cos 01,-2,1+1. 



2 DISCRETE FORMULATION OF THE MDGP 



6 



From each solution of the molecule with i atoms, at most two new solutions can be obtained by using 
sinw i _ 2 ,i+i = — cos2 ^i-2,i+i m matrix given in (3). This concludes the proof. □ 

An immediate corollary is given below. 
2.3 Corollary 

For an instance of the DMDGP class with n > 4 atoms, there are at most 2 n ~ 3 possible solutions. 

Note that each possible solution of the DMDGP is defined by a sequence of torsion angles u>i 4, . . . , w„_3.„. 
By using the matrices Bi (3), this sequence can be converted into another one of Cartesian coordinates 
x = (xi, . . . ,x n ) G R 3n and, using the objective function g defined in (1), a solution can be identified 
simply by testing if g{x) — 0. 

2.1 Solution symmetry 

In this section, we show that there is a solution symmetry around the plane defined by the first three 
atoms; more precisely, any solution on one side of this plane gives rise to a symmetrical solution on the 
other side. This allows us to reduce computational costs by half. First, we need two lemmata. 



2.4 Lemma 

Let the matrix Qi be defined by 

Qi = £>4 - 

for i = 4, ...,n, where its elements are denoted by 



• Bi, 



9h 

921 
931 





Q22 
Q32 




9l3 
923 
«33 




Hi 

924 
934 
1 



If we invert the sign of sin 0^-3^ in all the matrices Bi (3), for i = 4, 
obtained by B\, then the elements of the matrix Q\, defined by 



.,n, and denote the new matrices 



is given by 



Qi 



for i — 4, n. 



Qi 


= B\- 


■B'i, 




9li 


9I2 


-9*3 


9*14 


921 


922 


-923 


924 


-931 


-932 


933 


-934 











1 



Proof. The proof is by induction. For n — 4, we obtain: 



Qi = 



— COS #2,4 —sin 6*2,4 

Sin 02,4 COS Wi,4 — COS 02,4 COS U^i 5 4 

sin 02,4 sin wi ; 4 —cos 02,4 sin wi ; 4 





—(^3,4 COS 02,4 

-sinwi,4 d 3; 4 sin 2 ,4 cos 

coswi.4 d 3j 4 sin 2 .4 sin 0^4 

1 



and 



Qa 



- COS 02.4 
12 4 COS Wi 4 

,4) 





sin l» ^ i v \ '. . _v j 1 
sin 2 .4 (— sincj li4 ) 



- sin 02.4 
— COS 02,4 cos 0*1,4 



— cos 2 .4 (— sin 0^4 







— ^3,4COS02,4 

- (— sillWi.4) G?3.4 Sin 02,4 COSW1.4 

COSWL4 G? 3 ,4 sin0 2i 4 (— sinw lj4 ) 

' 1 
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Suppose now that the assertion is valid for n = i — 1. Rewritting Qi, we get 



Qi — (B4 ■ ■ ■ Bi_i)Bi 
= Qi-iBi, 



where the elements of Qi-i are denoted by 



Q 



i-i 



and 



— COS #i_2,i 

sin 0i-2,i cos Wi_3,i 
sin 0i_2,i sinwi-a.i 




'Zn 1 


Q12 1 


lis 1 


it, 1 ' 








924 1 








934 1 











1 _ 


— sin 


9i-2,i 








COS 6i-2,i COS0JiS,i 

-cos 0i_ 2 ,i sin Wi_ 3i i 




-smwi_ 3j i 
cos Wj_3 ; j 




cos 6»i_ 2 ,i 
c?i_ M sin ^_2,i coswi_ 3 ^ 
rf<— i,i sin #i_ 2 ,i sin Wi_ 3ii 
1 



Considering the product Qj_i-Bj, we obtain 

Qi-lBi 

where 



v x y z 
0001 



v 



y = 



z = 



sir 1 

i-l 
921 

93T 1 

^r 1 

?2i 1 

93I 1 

112 1 
122 1 
Q32 1 

?3i 1 



-b) + q\^{cd)+q\^{ce) 

- b ) +1 l 22 1 ( cd ) +Q23 1 ( ce ) 

-b) + qi- 1 (cd)+q^ 1 (ce) 

-c) + q r 2 1 (-bd)+q{- 3 1 (-be) 
-c) + q^i-bd) + q2 3 l (-be) 
-c) + ql2 1 (-bd) + ql 3 - 1 (-be) 

-e) + q\ 3 - 1 (d) 



-e) + q^ 1 (d) 
-e) + qls 1 (d) 

-ab) + q l 12 1 (acd) + q l ^ 1 (ace) + q^ 1 
-ab) + q l 22 1 (acd) + q^ 1 (ace) + q^ 1 
-ab) + q^iacd) + qiz 1 (ace) + q^ 1 

and a — b — cos#,_2,i, c = sin#i_2,i, d = cos ^-3^, and e = sin 0^-3, j. 

By induction hypothesis, we have 



Q'i-i = 





912 1 


-913 1 


" 


til' 


922 1 


-923 1 




-ill 1 


-42 1 


933 1 


-93I 1 











1 



Considering the product Q' i _ 1 B' i , where 



B' 



-COS ffi-2,i 

sin 6»i_ 2 ,i cosa)i_ 3 ^ 
sin 6>i_2,i(-sincji_ 3ji ) 




- sin 0j_2,» 

- COS #i_2,i COS Wi_3,i 

-cos 0j_ 2 ,i(-sin Wi_ 3j i) 




-rfi_i ii COS0j_2,j 

(- sinwi_ 3ii ) sin6»i_ 2 ,i cosWj_ 3ii 

coswi_ 3ji di-i,i sin0i_2,i(- sinw;_ 3ji ) 

1 



we obtain 



Qi-iBi 



V X' Y' Z' 
1 
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where 



V 



x' 



Y' = 



Z' = 



9 2 7 1 ( 

922 1 ( 



b) + q {- 1 (cd)- q \- 1 (c(-e)) 
b) + q^\cd)-qi 3 1 (c(-e)) 

b) -q^\cd) + q^\c{-e)) 

c) + q r 2 i (-bd)- q r 3 i (-b(-e)) 

c) +q i -\-bd)-a^\-b{-e)) 
c)- q ^\-bd)+ q ^\-b{-e)) 

e)-q\t{d) 

e)+ q l 33 1 (d) 

-ab) + q % ^ 1 {acd) - q{ 3 1 (ac(-e)) + g^J 1 
-ab) + q % ^ 1 {acd) - q z 23 1 (ac(-e)) + q^ 1 
-ab) - quitted) + q^ 1 (ac(-e)) - q^ 1 



Representing the matrix Qi by 



Qi = Qi-iBi 



9u 9i2 lis 9i 4 

921 922 923 924 

931 932 933 934 

1 



and comparing the matrices Qi-\Bi and Q' i _ 1 B[ given above, we conclude that 



Qi = Q'i-iB'i 



111 


9*12 


-9l 3 


9l4 


921 


922 


-923 


924 


-931 


-932 


933 


-934 











1 



□ 



2.5 Lemma 

Le Xi, . . . , x n G R 3 he the Cartesian coordinates associated to the torsion angles W1.4, . . . , u> n -3 in . If we 
invert the sign of smcdis^ in all the matrices Bi (3), for i = 4, n, then the new Cartesian coordinates 
x[ , . . . , x' n 6 M 3 is given by 



< 






< 




x i2 






— X{ 3 



for i = 1, n. 



Proof. For n = 1, 2, 3 the assertion is clearly true. By the lemma above, we have 



and 



»3 
1 



= BiB2BsQi 



B\BiB- i Q\ 



B1B2B3 



= B1B2B3 



9i4 
924 

9| 4 

1 



9l4 

924 

"934 
1 
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for i = 4, n, and calculating the product B 4 B 2 B 3 , we obtain 



B1B2B3 



C0S#i ; 3 

sin 01,3 





sin 6»i ;3 

COS 01,3 






-di,2 + d 2 ,3 cos 6»i ;3 
d 2 .3 sin 0i,3 




Thus, 



and 



= B1B2B3 



B1B2B3 



13 
1 



9l4 




-di,2 + 9i4 cos 0i,3 + q 24 sin 6 


'1,3 


Fd 2 ,3 COS 0i,3 


<?24 




g| 4 sin 0i,3 - q\ 4 cosf 


?i,3 


+ d 2 , 3 sin 0i,3 


934 








-934 


1 








1 


9l4 " 




-di )2 + q\ 4 cos 0i,3 + sin 


h, 3 


+ d 2 ,3 COS 0i,3 


924 




q\ A sin 0i,3 - $ 4 cos 


9l,3 


+ d 2 ,3 sin 0i,3 










934 


1 








1 



for i 



That is, 



«2 



for i = 1, n. 



□ 



Finally, we can prove the following theorem. 
2.6 Theorem 

Consider a solution x : G — > M 3 for the DMDGP, defined by the torsion angles Wi j4 , . . . ,u> n -3,n- If we 
invert the sign of sinwj-s,, in all the matrices Bi (3), fori = 4, ...,n, we obtain a new solution x' : G — > M 3 
for the DMDGP. 

Proof. Le xi, . . . ,x n G K 3 be the Cartesian coordinates associated to the torsion angles W1.4, . . . , u n _3,„, 
Xi , . . . , a4 e K 3 be the Cartesian coordinates of the new solution obtained by inverting the sign of 
sin Wi_3,i in all the matrices Bi, for i = 4, n, and R : R 3 — > M 3 be the function defined by 

R(%il 7 , ) (*^1 7 ^2 7 ^3 ) ' 

Since i? is a unitary operator, 

11^-^-11 = H^O- 72(^)11 V(i,j)6£, (5) 

where £7 is the set of pairs of atoms (i, j) whose Euclidean distances dij are known for the solution x. 
By the Lemma 2.4., 

\\R(x i )-R(x j )\\ = \\x' i -x' j \\ V(i,j)€E. (6) 

Since x\, . . . , x n is a solution for the DMDGP, 

Iki ~ ^11 = v (*^') e ^- 

Thus, by (5) and (6), we get 

\\ x 'i - x 'j\\ = d ij v (*>i) e s > 
stating that x\ , . . . , x' n is also a solution for the DMDGP. □ 
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2.2 Undiscretizable instances 



As has been remarked, all DMDGP instances must obey a strict triangular inequality. When this does 
not hold, there may be bond angles with values kir for k € Z. By (4), the torsion angle is undefined. 
Since the torsion angle is the angle between two normal vectors to given planes, it is undefined when 
at least one of the planes is undefined. This is indeed possible if the two vectors defining the plane are 
collinear. In other words, if a bond angle is a multiple of 7T, we have the situation depicted in Fig. 3, 
where <ij_3,i is feasible for every position of atom i + 3 on the circle shown in the drawing. 







i 




















i-3 




~ ~ ~ ~ 1 *■ ' 1 


i' 



Figure 3: An instance which cannot be discretized. The i-th atom can be on any position on the circle 
shown without affecting the feasibility of the distance <ij-3,i. 

Since the set {tt} has measure in [0,2-7r], the probability that any given protein gives rise to an 
undiscretizable instance is 0. To the best of our knowledge, no protein has bond angles of exactly tt. 



3 Complexity 

In this section we show that the DMDGP is NP-complete by reducing it from the Subset-Sum problem. 

Subset-Sum. Given integers ai,...,a n , is there is a partition into two sets, encoded by 
s € { — 1, +1}™, such that each subset has the same sum, i.e. Yl?=i s(i)a,i = 0? 

The MDGP is shown to be NP-complctc in [19] (a helpful sketch of the proof is given in [13]) by 
reducing Subset-Sum to a 1-dimensional MDGP with distance constraints between successive atom (in 
an arbitrary atomic ordering) plus a single distance constraint between the first and last atom, forcing 
this distance to be zero. In the DMDGP however, we have additional distance constraints between any 
pairs of atoms 1, 2 or 3 indices apart in the atom sequence. 

3.1 Theorem 

The DMDGP is NP-complete. 

Proof. We reduce from Subset-Sum. Given an instance ai,. . . ,a n of the latter, we define an instance 
of DMDGP on 3n + 1 points numbered to 3n, with the following distance constraints: 

d it i+i = a|j/3j Vi e {l,...,3n- 1} (7) 



d iti+2 = Jd* + df +1 .. i+2 V* e {1, 3n - 2} (8) 



V?: e {1, . 


., 3n 


-1} 


V* e {l,. 


., 3n 


-2} 


e {1, . 


., 3n 


-3} 



d M+3 = + d 2 l+l l+2 + d 2 l+2 l+3 Mi £ {1, .., 3n - 3} (9) 

d , 3 n - (10) 
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Now we claim that the Subset-Sum instance is has a solution iff the DMDGP instance has a solution. 

For the easy direction, let s G {— 1,+1}" be a solution to the SuBSET-SUM-problem. We define the 
3n+l points as follows: xq = (0,0,0) and for every < i < in with i = 3k +j we set Xi — Xi-i +Skdkej, 
where eo = (1,0, 0),ei = (0, 1,0) and ei = (0,0, 1). By straightforward inspection this is a solution to 
the DMDGP instance. 

For the other direction, let us assume that the DMDGP instance has a solution x(v\), . . . , x(v n ). 
Without loss of generality we can assume that the x(vi) — (0,0,0), and that x(v2) lays on the x-axis. 
Now equation (8) implies that the bond angle between x(v\), x(u 2 ), x{v$) is ^. Again, without loss of 
generality assume that the second segment is parallel to the y-axis. By equation (9) there are only two 
possibilities for x(vi), and they force the third bond to be parallel to the z-axis. The same arguments 
apply to all other bonds, which shows that the bond (3 between Vi-i and v% is parallel to the (i mod 3)-th 
axis (where x — 0,y = 1, z = 2). Now give the bond (3 an orientation from Vi-i to Vi (which can either 
be in the same or in the opposite direction of this axis). We define a sign vector s G { — 1, +1} 3 ™, which 
encodes these orientations. In this setting, point 3n + 1 has coordinates (x, y, z) defined by 

o 

1 

2 

By equation (7) we actually have (x, y, z) — (0, 0, 0). Now let s°, s 1 , s 2 be three vectors from { — 1, +1}", 
which are s restricted to indices i mod 3 = 0, i mod 3 = 1 or i mod 3 = 2 respectively. Then any of those 
is a solution to the original Subset-Sum problem by the previous equations. □ 

It is interesting to note that Assumption 1 is, in a certain sense, the tightest possible for the problem 
to be NP-complete. Assumption 1 states that each quadruplet of consecutive vertices in the defined order 
is a clique in the distance graph. Tightening the assumption further, we might ask whether the problem 
would still be NP-complctc if each quintuplet of consecutive vertices were a clique. This, however, fails 
to be the case. A trilateration graph in R D is a graph with an order (v\, . . . , v n ) on the vertices such for 
all vertices Vi with i > D + 1, {j, i] S E for all j G {i — D — 2, . . . , i — 1} (i.e. each vertex is adjacent to the 
preceding D + 1 vertices) . In three-dimensional space, this implies having distances to at least 4 vertices 
earlier in the order, which means having a clique for each consecutive quintuplet. By [7] (Theorem 9), 
the MDGP problem associated to a trilateration graph can be solved in polynomial time. 



4 Branch-and-Prune algorithm 

In this section we present a Branch-and-Prune (BP) algorithm for the DMDGP. The approach mimicks the 
structure of the problem closely: at each step we can place the i-ih atom in two possible positions Xi , x^. 
However, either or both of these positions may be infeasible with respect to a number of constraints. 
The search is branched on all atomic positions which are feasible with respect to all constraints; by 
contrast, if a position is not feasible the search scope is pruned. In this context we call the feasibility 
verifications pruning tests. The simplest (but very effective) type of these is the Direct Distance Feasibility 
(DDF) pruning tests: for all distance pairs {j, i} G F (with j < i — 4, see Sec. 1, p. 2) we check that 
(\\xj — Xi\\ 2 — d 2 ^) 2 < e, where e > is a given tolerance. If the inequality does not hold, we prune the 
search node. 

The BP algorithm is therefore an algorithmic framework whose definition is completed by cxpliciting 
the pruning tests. These can be of geometrical or of physical-chemical nature. These can be of geometrical 
or of physical-chemical nature. Apart from the DDF pruning tests, many other tests are possible; a few 



i mod 3— 

y= E 

i mod 3— 

*= E 

i mod 3— 
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of them are discussed below. This algorithm, as described, will find all solutions to the problem. If we 
are interested in one solution only, we can stop the search as soon as we have placed the last atom in a 
feasible position. 

4.1 Algorithmic Framework 

Let T be a graph representation of the search tree. Initially, T is initialized to the search nodes 1 — > 
2 — > 3 — > 4 (no branching) since the first three atoms can be fixed to feasible positions x\, X2,x^ and the 
fourth atom x 4 can be fixed to any of its two possible positions by Theorem 2.6. By the current rank 
of the search tree we mean the index of the atom being placed at the current node. At each search tree 
node of rank i we store: 

• the position Xi <E R 3 of the i-th atom; 

• the cumulative product Qi = YVj=i Bj 01 the torsion matrices; 

• a pointer to the parent node P(i); 

• pointers to the subnodes L(i),R(i) (initialized to a dummy value PRUNED if infeasiblc). 

Notice that the edge structure of the graph T is encoded in the operators PQ, L(), R() defined at each 
node. The recursive procedure at rank i — 1 is given in Algorithm 1. Let y = (0, 0, 0, 1) , e > a given 
tolerance and v a node with rank i — 1 in the search tree T. 

Algorithm 1 BP algorithm. 
0: BranchAndPrune(T, v, i) 
if (i < n — 1) then 

Compute the possible placements for i-th atom: 
calculate the torsion matrices Bi,B[ via Eq. (3); 

retrieve the cumulative torsion matrix Qi-\ from the parent node P(v); 

compute Qi = Qi-\B U Q\ = Qi-\B[ and Xi,x' { from Qiy,Q' t y; 

let A = 1, p = 1; 

Pruning tests: 

if (xi is feasible) then 

create a node z, store Qi and Xi in z, let P(z) = v and L{v) = z; 

set T ^TU{z}; 

BranchAndPrune(T, z, i + 1); 
else 

set L(v) = PRUNED; 
end if 

if (x'i is feasible) then 

create a node z', store Qi and Xi in z', let P(z) = v and R(v) = z'\ 

set T ^TU{z'}; 

BranchAndPrune(T, z', i + l); 
else 

set R(v) = PRUNED; 
end if 
else 

Rank n reached, a solution was found: 

solution stored in parent nodes ranked n to 1, output by back-traversal; 
end if 
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4.2 Pruning tests and error tolerance 

There are two types of experimental errors arising from NMR distance measurements: (i) systematic 
uncertainty on each measurement, and (ii) a certain percentage of completely wrong measurements [3]. 
Errors of the first type are usually dealt with by introducing distance bounds [14], which the BP algorithm 
can take into account without any problem. To the best of our knowledge, errors of the second type have 
only been tackled by the Error Correcting Code (ECC) proposed in [3]. Naturally, this ECC can (and 
should) be applied to the protein backbone distance matrix as a preprocessing step to running the BP 
algorithm. On top of this, however, many of the pruning tests are "natively" suited for attempting to 
correct this type of error probabilistically, if a measure of the infcasibility is provided by the test. We 
only show here how to adapt the DDF tests for the two types of NMR errors. 

If we consider distance bounds like < dji < d^ for each {j, i) G F we simply have to modify the 
pruning tests as follows. Placing atom i at search node v, for j 6 5(i) PI F, 

1. for L(v): if d^ < \\xj — x,|| < d^ then Xj is feasible, else it must be pruned; 

2. for R(v): if d^ < \\xj — x£|| < d^ then x\ is feasible, else it must be pruned. 

As for the errors of the second type, let lOOp (with p e [0,1]) be the known average percentage of 
completely wrong measurements. We deal with these in a probabilistic way: suppose we are positioning 
the i-th atom in position Xj, w.l.o.g. at the left node L(v) (the reasoning for the right node is the same). 
A distance dji is infeasible for Xj if the corresponding Pruning test for L(v) fails. We prune L(v) 
from the search tree only if more than p\5(i) (~l F\ distances are infeasible for L(v). The downside of this 
method is that it may introduce some false positives in the solution set. 

4.3 Euclidean bounds pruning tests 

These tests employ the fact that inter-atomic distances are assumed to be Euclidean. Much like the 
pruning of the search scope in point-to-point Dijkstra shortest-path searches on Euclidean graphs, we 
can prune away an atomic position i if it is too far with respect to the given distances. Consider atoms 
h,i,k with h < i < k such that {h, k} G E (so that dhk is known). Assume that the BP has already 
placed atom h, and that we are now verifying feasibility for atom i. Let D(i, k) be an upper bound to 
the distance ||xj — Xfc|| for all possible immersions x : G — > R 3 which are feasible DMDGP solutions. 

4.1 Lemma 

If D(i, k) < \ \xu — Xi\ \ — d}^ for all feasible x : G — > R 3 , then the BP search node for atomic position Xj 
can be pruned. 

Proof. Suppose, to get a contradiction, that position x, is feasible for the DMDGP instance being solved. 
By definition, D(i,k) > ||x, — Xfe||. Since distances are Euclidean, ||xj — Xfe|| > \\xh — Xj|| — ||x/j — Xfc||. 
Hence D(i, k) > \ \xu — x,|| — dhk > D(i, k), which is a contradiction. □ 

By Prop. 4.1, every upper bound D(i,k) to the distance \\xi — Xk\\ provides a valid pruning test. 
Furthermore, in all Euclidean graphs the Euclidean distance between two vertices is a lower bound to the 
cost of all paths joining the two vertices in the graph. We therefore let D(i, k) be the cost of the shortest 
path from i to k in G, which provides a valid pruning test. 
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4.4 Continuous optimization-based pruning tests 

Global optimization techniques for the smooth nonconvex formulation (1) of the MDGP [8, 13, 14, 10, 12] 
can be employed to verify that the current backbone is feasible. As most of these techniques arc usually 
computationally expensive, this type of pruning tests may only be performed once in a while. Although 
the GCA [13, 14] scores the best computation times, it usually provides solutions of rather low accuracy, 
so it may not be the best candidate. 



4.5 Physical-chemical pruning tests 

As mentioned in the introduction, a SCPP can be solved for every backbone in order to try to place the 
side-chain residues onto the backbone. If the SCPP is infeasible, this means that the backbone is also 
infcasible. Therefore, for every partial backbone we can try to solve the associated SCPP to attempt to 
prune some search branches. Since the SCPP is NP-complete (hence its solution may be computationally 
very expensive), this type of pruning tests should only be performed once in a while. 

Other types of physical-chemical pruning tests can be devised on a per-molcculc basis. 



4.6 Detailed example 

In this section we discuss the application of Algorithm 1 to a simple example (artificially generated as 
explained in [9], also see Section 5.3). 

The instance in question (called lavorll_7), with all bond angles set to 1.91 radians, has 11 atoms: 

(5(2) UF = {9}, dl = (3.387634917) 

6(3) UF = {8, 9, 10}, rff = (3.96678038, 3.003368265, 3.796280236) 

<5(4) UF = {8, 9, 10}, = (2.60830758, 2.102385055, 3.159309539) 

(5(5) U F = {9, 10}, = (2.689078459,3.132251169) 

(5(6) U F = {10}, dl = (3.557526815) 

(5(7) U F = {11}, rff = (3.228657023). 

The distances in H are of course 8(i)UH = {i + 1, i + 2, i + 3} for alH < n — 3, 5{n — 2) UH = {n — 1, n}, 
5(n — 1) U H = {n}. The vector of the distances in H is: 

d H = (1.526,2.491389536,3.83929637, 
1.526, 2.491389536, 3.831422399, 
1.526, 2.491389536, 3.835602674, 
1.526, 2.491389535, 3.030585263, 
1.526, 2.491389534, 2.899348439, 
1.526, 2.491389535, 3.086914764, 
1.526, 2.491389536, 2.788611167, 
1.526, 2.491389536, 2.888815709, 
1.526,2.491389537, 
1.526). 

As can be seen from the BP tree given in Fig. 4 (this is actually the output of Algorithm 1 on the given 
instance), this instance has four solutions: the leaf nodes at rank 11 — the rank is given by the number 
of the leftmost node in each row. Notice that the earliest node when some pruning occurs is at rank 7, i.e. 
no pruning occurs before the placement of the 8-th atom. This happens because there are no distances 



5 COMPUTATIONAL RESULTS 



15 




Figure 4: The BP tree of the instance of Section 4.6. 

{j, k} £ F with k < 8, so each position for atoms with index i < 8 is feasible (by construction of Xi, x^) 
with the distances in F. Again, there is pruning at ranks 8, 9, 10, i.e. during the placement of atoms 9, 
10, 11, because there are distances {j, k} £ F with k = 9, 10, 11. One of the solutions is shown in Fig. 5. 




Figure 5: One of the possible solutions of the lavorll_7 instance. 



5 Computational Results 

In order to test the viability of the proposed method, we tested a class of randomly generated MDGP 
instances described in [9] . We present comparative results of BP (where only the DDF pruning tests have 
been implemented) and another existing MDGP software called dgsol implementing the GCA [14]. It 
turns out that BP is always superior to dgsol for solution accuracy, generally superior as regards speed, 
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and inferior as regards memory requirements. It is fair to remark here that the GCA is able to solve the 
MDGP, whereas BP is limited to solving the DMDGP only. In this sense the present comparison is not 
completely fair to GCA. 

5.1 Software testbeds 

The software code dgsol [14] (version 1.3) can be freely downloaded from 

http : //www.mcs . anl . gov/~more/dgsol/. 

The algorithm implemented by the dgsol code is very different from ours. First, it targets a more general 
problem class: the Molecular Distance Geometry Problem with Distance Bounds. In this problem, lower 
and upper bounds to atomic distances are known, rather than the exact atomic distances. Since these are 
usually estimated through NMR techniques, it is realistic to assume that there is an experimental error 
in the measurements (our approach does not consider this issue yet). Secondly, dgsol needs to make 
no assumption whatsoever about the distances of triplets and quadruplets of consecutive atoms being 
known. Thirdly, dgsol is based on a continuous smoothing of the original problem to a form which has 
fewer local minima. An ordinary NLP optimization method is then applied to the modified problem, and 
the optimum is traced back to the original problem. This is a fully continuous optimization algorithm, 
whereas BP is a discrete method. 

It turns out that the main advantages of BP over dgsol are: 

1. tractability of larger instances; 

2. higher solution accuracy; 

3. BP can potentially find all feasible solutions, not just one. 
By contrast, the main advantages of dgsol over BP are: 

1. it targets a larger class of problems; 

2. its running time seems to increase very slowly (and regularly) as a function of the number of atoms 
in the molecule, at least when the set of given distances is comparatively small; 

3. the amount of memory needed to complete the search is negligible. 

The BP algorithm behaves very unpredictably with respect to the amount of needed memory, some- 
times requiring over 1GB RAM for relatively small molecules (40 atoms), sometimes solving 1000-atoms 
instances in a few seconds and very little memory. 

5.2 More-Wu instances 

The More-Wu instances are finite three-dimensional hypercubic lattices with s 3 atoms, as shown in Fig. 6. 
The bond lengths parallel to the coordinate axes are assumed to be 1. By providing the instances with 
the obvious atomic ordering (as shown in Fig. 6) and the bond angles, we can make them amenable to 
the application of our method. However, since many of the bond angles $ are equal to n (e.g. the angle 
between atom 1 and atom 3 in Fig. 6), these instances are undiscretizable (see Sect. 2.2). In particular 
we get sini9 = 0, so Eq. (4) ceases to hold. 
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Figure 6: The s = 3 More-Wu instance with 27 atoms. 

In order to test these instances, we perturbed the lattice points X{ — {x i i 1 x i 2 1 x a ) in the following 
way: 

V^ < s 3 (i mod 3 = => (x a <- x, 3 + (-1)*»7))> (H) 

where r\ was taken to be 0.25. This gave rise to instances which we call "modified More-Wu instances" 
(mmorewu-s, where s 3 is the number of atoms in the molecule). An example is shown in Fig. 7. It is 
worth mentioning that as the original More-Wu instances describe a molecular structure rarely, if ever, 
found in proteins, we feel our perturbation does not alter crucial molecular characteristics. We generated 



♦ z 




Figure 7: The modified More-Wu instance with 27 atoms. The dashed arrow indicates the atomic 
ordering. 

all modified More-Wu instances from s = 2 to s = 6. 
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5.3 Lavor instances 

These instances, described in [9], are based on the model proposed by [15], whereby a molecule is repre- 
sented as a linear chain of atoms. Bond lengths and angles are kept fixed, and a set of likely torsion angles 
is generated randomly. Depending on the initial choice of bond lengths and angles, the Lavor instances 
give rather more realistic models of proteins than other randomly generated instances do (see for example 
the instances described in [14]). Fig. 5 gives an example of a Lavor instance. In the numerical tables, 
we labelled the Lavor instances by lavorn-m, where n is the number of atoms in the molecule and m is 
an instance ID (since there is a random element of choice in the generation of the Lavor instances, many 
different instances can be generated having the same atomic size). 

We generated 10 different Lavor instances for each size n = 10, ... ,70 ("small set"), and 4 different 
Lavor instances for each size n in {100i|l < i < 10} ("large set"). 

5.4 Hardware and memory considerations 

All tests have been carried out on an Intel Pentium IV 2.66GHz with 1GB RAM, running Linux. The 
code implementing the BP algorithm has been compiled by the GNU C++ compiler v. 3. 2 with the -02 
flag. As mentioned above, BP can be very memory-demanding. We deliberately took the choice of 
employing a low-end PC with just 1GB RAM to show just how powerful this technique can be even with 
modest hardware. 

The BP algorithm is in general very fast, since all it does is testing feasibility with the known distances 
at each branched node. However, exploring the search space may require a lot of memory, especially if no 
pruning occurs early in the run. Consequently, when the physical RAM of the test machine is exhausted, 
and the operating system starts swapping to disk, the total CPU elapsed time size becomes unmanageable. 
Thus, it was decided to kill all jobs requiring more than 1 GB RAM. In particular, we solved almost all 
the Lavor instances in the "small set" and found one solution for each of the Lavor instances in the "large 
set". 

5.5 Comparative results 

The full results table for the complete test suite includes 655 instances and spans 14 pages: thus, only 
a sample will be presented in detail. The averages, however, are taken with respect to the whole suite. 
The e parameter in the DDF pruning tests was set to 1 x 10~ 3 for all tests. Table 1 contains detailed 
results for the sample. The instances are described by their name, their atomic size n and the number 
of given distances \S\. Note that in order to use dgsol, the lower and upper bounds to these distances 
were set to ±5 x 10 -4 . Other than this, dgsol was used with all default parameter values. The results 
refer to three methods: dgsol, BP stopped after the first solution was found (BP-Onc), and BP run to 
completion (BP-AU). For dgsol and BP-One, the user CPU time (in seconds) was reported, as well as 
the Largest Distance Error (LDE), defined as 



employed as a measure of solution accuracy (the lower, the better). For the (BP-AU) method, we reported 
the user CPU time and the number of solutions found (#Sol). Missing values are due to excessive memory 
requirements (over 1GB RAM). 

It is immediately noticeable that whereas dgsol always finds a solution, BP sometimes fails to find 
one within 1 GB RAM. It is instructive, however, to look at the solution accuracy (taken over the whole 
test suite): whereas dgsol ranges from 4.5 x 10~ 7 to 0.875 (excepting a couple of out-of-scale values 
clearly due to some numerical instability), BP scores a rather more impressive 4.74 x 10~ n to 5.62~ 6 . 
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Instance 


dgsol 


BP-Onc 


BP-A11 


Name 


n 


\s\ 


CPU 


LDE 


CPU 


LDE 


CPU 


#Sol 


mmorewu-2 


8 


28 


0.02 


2.63E+5 


0.00 


4.37E-10 


0.00 


2 


mmorewu-3 


27 


331 


0.23 


6.99 


0.00 


2.97E-09 


0.00 


2 


mmorewu-4 


64 


1882 


0.67 


7.79E-6 


0.00 


5.56E-09 


0.00 


4 


mmorewu-5 


125 


7105 


2.94 


3.54E-6 


0.00 


1.67E-08 


0.01 


4 


mmorewu-6 


216 


21461 


18.65 


0.032 


0.02 


4.91E-08 


0.03 


4 


lavorlCLO 


10 


33 


0.02 


1.57E-5 


0.00 


5.36E-10 


0.00 


4 


lavorl5_0 


15 


57 


0.10 


4.04E-5 


0.00 


2.84E-09 


0.00 


16 


lavor20_0 


20 


105 


0.14 


2.77E-5 


0.00 


6.13E-09 


0.00 


8 


lavor25_0 


25 


131 


0.84 


1.18E-4 


0.00 


1.38E-09 


0.00 


8 


lavor30_0 


30 


169 


0.40 


1.75E-5 


0.00 


1.23E-09 


0.00 


2 


lavor35_0 


35 


171 


0.81 


9.33E-5 


0.00 


1.52E-09 


0.00 


64 


lavor40_0 


40 


295 


2.84 


0.096 


0.00 


2.87E-09 


0.00 


2 


lavor45_0 


45 


239 


3.33 


0.170 


0.00 


6.92E-09 


0.00 


2 


lavor50_0 


50 


271 


3.45 


0.696 


0.00 


3.96E-08 


0.46 


4096 


lavor55_0 


55 


551 


5.80 


0.257 


0.00 


2.66E-09 


0.00 


64 


lavor60_0 


60 


377 


5.15 


0.049 


0.00 


3.51E-09 


0.00 


64 


lavor65_0 


65 


267 


2.61 


0.065 


0.00 


7.76E-10 


- 


- 


lavor70_0 


70 


431 


8.73 


0.107 


0.02 


1.64E-09 


- 


- 


lavorl00_2 


100 


605 


6.95 


0.167 


2.26 


4.01E-09 






lavor200_2 


200 


1844 


63.52 


0.395 


0.00 


5.66E-08 






lavor300_2 


300 


2505 


100.99 


0.261 


0.03 


1.56E-08 






lavor400_2 


400 


2600 


182.21 


0.767 


0.01 


3.35E-09 






lavor500_2 


500 


4577 


329.29 


0.830 


0.27 


4.69E-07 






lavor600_2 


600 


5473 


299.76 


0.700 


0.01 


4.94E-08 






lavor700_2 


700 


4188 


281.34 


0.569 


0.16 


1.83E-06 






lavor800_2 


800 


6850 


570.20 


0.528 


3.34 


3.37E-06 






lavor900_2 


900 


7965 


550.26 


0.549 


3.08 


5.62E-06 






lavorl000_2 


1000 


8229 


844.52 


0.695 


0.81 


2.04E-06 







Table 1: Computational results. Missing values are due to excessive memory requirements (> 1GB 
RAM). 

On average, the solution accuracy obtained by dgsol is 9.55 x 10~ 2 whereas BP averages 4.56 x 10~ 8 . 
Furthermore, all the instances in the Lavor "large set" are solved by dgsol to a solution accuracy of 
order lO^ 1 : given that in BP pruning often occurs for feasibility differences of order 10 _1 and even 10~ 2 , 
such a slack solution accuracy may mean that dgsol is not actually finding the correct solution. 

Table 2 reports the averages of the same parameters as in Table 1 taken over 10 Lavor instances in 
a sample of the "small set" and over 4 Lavor instances in a sample of the "large set" . It appears clear 
from these data that BP's strong points are indeed speed and accuracy. A graphical representation of 
the averages taken over the whole Lavor test set is shown in Fig. 8 (user CPU average taken to solve the 
instances in function of the molecular size by dgsol and BP-One) and Fig. 9 (average accuracy of the 
solution attained by dgsol and BP-One). Notice the huge y-axis scale difference in the two pairs of plots 
(around 300 times smaller in favour of BP for CPU and around 30000 times smaller in favour of BP for 
accuracy). 

5.6 The number of solutions 

It is remarkable that in Table 1 BP- All always finds a number of solutions which is a power of 2. Although 
we were not able to ascertain the exact reason why the More-Wu or Lavor instances had these properties, 
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Instance 


dgsol / avg. 


BP-One / avg. 


BP-A11 / avg. 


n 


CPU 


LDE 


CPU 


LDE 


CPU 


#Sol 


10 


0.03 


4.40E-01 


0.00 


1.19E-09 


0.00 


1.54E+01 


15 


0.08 


1.96E-02 


0.00 


1.23E-09 


0.00 


3.72E+01 


20 


0.23 


3.20E-03 


0.00 


1.94E-09 


0.00 


6.90E+01 


25 


0.56 


1.58E-02 


0.00 


1.58E-09 


0.02 


1.14E+02 


30 


0.65 


1.03E-02 


0.00 


3.45E-09 


0.01 


2.65E+02 


35 


1.10 


5.43E-02 


0.00 


2.84E-09 


0.10 


3.35E+03 


40 


1.41 


2.61E-02 


0.00 


5.75E-09 


0.02 


8.48E+02 


45 


2.13 


5.80E-02 


0.00 


6.25E-09 


0.12 


2.48E+03 


50 


2.54 


1.65E-01 


0.00 


6.62E-09 


0.16 


1.80E+03 


55 


4.10 


7.29E-02 


0.00 


5.53E-09 


0.03 


4.28E+02 


60 


4.47 


1.59E-01 


0.00 


6.44E-09 


0.04 


3.49E+02 


65 


4.64 


1.16E-01 


0.00 


8.37E-09 


1.21 


3.80E+03 


70 


7.63 


9.28E-02 


0.01 


1.07E-08 


- 


- 


100 


10.57 


3.53E-01 


0.57 


2.46E-09 






200 


57.34 


3.61E-01 


0.02 


2.00E-08 






300 


109.91 


4.03E-01 


0.03 


1.90E-08 






400 


173.54 


6.69E-01 


0.10 


1.05E-08 






500 


273.66 


6.19E-01 


0.16 


4.92E-07 






600 


351.15 


5.75E-01 


0.01 


5.47E-08 






700 


365.37 


7.03E-01 


0.82 


2.65E-06 






800 


583.65 


6.54E-01 


2.72 


1.90E-06 






900 


714.39 


6.88E-01 


1.68 


2.85E-06 






1000 


787.30 


6.88E-01 


0.41 


1.45E-06 







Table 2: Average statistics for Lavor instances (over 10 instances for the set of small instances and over 
4 for the set of large instances) . 

we were able, through Theorem 3.1, to ascertain that this behaviour does not apply to all instances in 
the DMDGP. 

5.1 Lemma 

The instance I = {101, 102, 104, 108, 1001, 1002, 1004, 1008} to the Subset-Sum problem has exactly 3 
solutions. 

Proof. We denote by S, S the partition of I solving the Subset-Sum problem. Let 1008 be in S, then 
exactly one of {1001, 1002, 1004} must also be in S, and the other two in S This force set membership 
of 101, 102, 104, 108 in the following way: if 1000 + x is in 5, then 100 + x is in S and vice versa, for all 
x e {1,2,4,8}. □ 

By the above Lemma and the proof of Theorem 3.1, there is a DMDGP instance with 2 x 27 solutions 
(the 2 factor is due to Theorem 2.6). 



6 Final Remarks 

In this paper we formally define an NP-complcte subclass of the Molecular Distance Geometry Problem, 
related to proteins, for which a discrete formulation can be supplied. Instances of this class can be solved 
by employing a Branch-and-Prune algorithm which makes it possible to find very efficiently one or all 
the solutions to the problem instance. Furthermore, typical NMR measurement errors can be taken into 
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Figure 8: Average user CPU time (plotted against molecular size) taken by dgsol (top) and BP-One 
(bottom). 

account by the algorithm. We illustrate the performance of our algorithm on a set of randomly generated 
instances. 
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